Reproduced from PCA using Python (scikit-learn)

PCA for Machine Learning¶

One of the most important applications of PCA is for speeding up machine learning algorithms. The MNIST database of handwritten digits is more suitable as it has 784 feature columns (784 dimensions), a training set of 60,000 examples, and a test set of 10,000 examples.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.express as px  #if you don't have this, install "pip install plotly"

#model validation
from sklearn.model_selection import train_test_split

# PCA
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_auc_score

import warnings
warnings.filterwarnings("ignore")

Download and Load the (image) Data¶

In [2]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
In [3]:
mnist.data.shape
Out[3]:
(70000, 784)
In [4]:
mnist.data.head()
Out[4]:
pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 pixel10 ... pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783 pixel784
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 784 columns

In [5]:
mnist.target.head()
Out[5]:
0    5
1    0
2    4
3    1
4    9
Name: class, dtype: category
Categories (10, object): ['0', '1', '2', '3', ..., '6', '7', '8', '9']

The images that you downloaded are contained in mnist.data and has a shape of (70000, 784) meaning there are 70,000 images with 784 dimensions (784 features). The labels (the integers 0–9) are contained in mnist.target. The features are 784 dimensional (28 x 28 images) and the labels are simply numbers from 0–9.

Predicting the numbers from 0 to 9.

Split Data into Training and Test Sets¶

In [6]:
X_train, X_test, y_train, y_test = train_test_split( mnist.data, mnist.target, test_size=1/7.0, random_state=0)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(60000, 784) (60000,) (10000, 784) (10000,)

Standardize the Data¶

In [7]:
# standardizing X features
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
In [8]:
X_train_scaled[:2]
Out[8]:
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Import and Apply PCA¶

Notice the code below has .95 for the number of components parameter. It means that scikit-learn choose the minimum number of principal components such that 95% of the variance is retained.

In [9]:
from sklearn.decomposition import PCA
# Make an instance of the Model
pca = PCA(n_components=0.95)
pca
Out[9]:
PCA(n_components=0.95)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=0.95)
In [10]:
# fit_transform X_train_scaled
X_train_pca = pca.fit_transform(X_train_scaled)
# Transform X_test_scaled
X_test_pca = pca.transform(X_test_scaled)
In [11]:
X_train_pca[:2]
Out[11]:
array([[-3.69763425e+00,  9.66873129e+00, -1.90737826e+00,
        -1.22670223e-01,  4.41010917e-02,  2.98328839e-01,
         3.86299273e+00,  2.51682777e+00,  6.44844123e-01,
         5.41619883e-01,  3.84999461e+00,  2.91186643e+00,
        -5.05153823e-01,  2.44849949e+00, -3.02270936e+00,
        -7.92215630e-01, -1.32816890e+00, -3.59872442e+00,
        -1.70405773e+00, -1.91267182e+00, -1.25668348e+00,
         8.75168380e-01, -1.91217173e+00, -2.89178760e+00,
        -6.22917826e-02,  4.97582943e-01, -7.14058797e-01,
        -7.39717894e-01,  6.12933833e-01,  5.36713209e-01,
         5.00520029e-01,  2.58296531e+00, -1.92330147e+00,
         1.08235587e+00,  1.92275199e+00, -1.15752056e-01,
         1.20054950e+00,  9.80943366e-01,  5.37792412e+00,
         1.41227982e+00, -2.24208679e-01, -5.83848506e-01,
        -4.29248691e-01,  6.77860026e-01,  3.02350882e-02,
        -7.68628687e-01, -8.29506788e-02,  1.05350508e+00,
        -1.58763756e+00,  1.85872520e-01, -7.20610725e-01,
        -5.79762108e-01,  3.53868553e+00, -4.18318009e-02,
         1.27255296e+00,  6.71879440e-01, -5.01685632e-01,
         8.16094272e-01,  1.27147719e+00,  3.98537690e-01,
         1.84670491e+00, -4.91293526e-01, -7.10745662e-01,
        -2.32278400e+00, -9.33476791e-01, -7.03950887e-01,
         7.86131063e-01, -3.98334305e-01,  1.26975589e+00,
        -1.29746355e-01, -1.25002766e+00,  5.00394913e-01,
        -9.18535357e-02,  3.75441709e-01, -8.61118722e-01,
         8.53854364e-02,  1.01041460e-01,  1.05489514e+00,
        -9.17144208e-01, -5.33112522e-02,  3.85633128e-01,
         5.05493087e-01, -9.70612336e-01,  3.97882654e-02,
        -8.70121309e-01,  5.52238920e-01,  6.42248788e-01,
         1.86629164e-01,  4.11115005e-01, -1.80872451e+00,
        -9.97270428e-01,  1.40651663e+00,  1.09438961e+00,
         5.67033244e-01, -2.41621416e+00,  8.32614329e-01,
        -2.80307451e-01, -8.95518212e-01, -7.58902803e-01,
         9.38032711e-01, -1.63971751e+00, -7.01887770e-01,
        -1.30535167e+00,  1.38162250e+00,  2.53087269e-03,
         8.38859168e-03,  3.45930253e-01,  2.27383500e-01,
        -5.97345324e-01,  6.59823300e-01,  6.14733534e-01,
         6.78958193e-01, -2.11998522e-01,  4.57010506e-01,
         2.54041854e-01, -2.58157433e-01,  7.52776565e-01,
        -5.89174611e-01,  1.01494437e+00,  1.25212894e+00,
         1.20114463e+00,  1.57510697e+00, -3.14391458e-01,
        -8.46679571e-01,  2.11994129e-01, -2.53670677e-01,
         1.89062982e+00,  1.18852234e+00, -8.26606387e-01,
         1.84072021e+00,  3.72221551e-01,  1.59772133e+00,
        -1.34195716e-01, -3.84600816e-01,  3.74850424e-01,
        -1.92923426e-01,  3.17048192e-01,  4.33978052e-01,
         7.40924889e-02, -2.13572807e-01, -8.18561106e-01,
        -9.47547651e-01, -1.27718173e-01,  7.38452413e-01,
         8.19264011e-02,  1.27139902e+00,  2.99632772e-01,
         3.87116082e-01, -2.56949415e-01, -5.58082610e-01,
         5.84717319e-01, -3.81235443e-01, -2.36741599e-01,
         7.73636924e-01, -1.19058723e+00,  9.96191645e-01,
        -4.19727249e-01, -4.13422219e-01,  2.03904406e-01,
        -6.18053895e-01, -6.29408308e-03,  1.87035597e-01,
        -4.65746884e-01,  3.92027753e-01, -5.99957163e-01,
         1.12353399e+00, -9.34426878e-02, -3.84795106e-01,
         3.92248789e-01,  3.12060430e-01, -4.93540307e-01,
        -3.28514925e-01, -1.33236152e+00, -2.11911795e-02,
        -5.21907920e-02,  1.81688199e-01, -6.11998915e-01,
        -1.37675331e-02,  3.66736187e-01,  1.69483495e+00,
         3.58042538e-01,  9.15914185e-02,  4.54110926e-01,
        -3.01956357e-01,  5.41002312e-01,  7.48957594e-02,
         7.22913541e-01, -4.17154684e-01, -4.97951292e-01,
        -4.52447259e-01, -8.28648605e-03, -5.49981448e-01,
         1.55790398e-01,  1.40559885e+00, -3.04013740e-01,
        -5.59478645e-01, -2.98812364e-01,  7.49180716e-01,
        -5.82953885e-01,  5.48395485e-01,  3.51077373e-01,
        -3.00096431e-01,  3.54962341e-01, -7.04468564e-01,
        -3.38453356e-01, -1.02500213e+00, -7.32388382e-02,
        -3.26750499e-01,  2.26523175e-01, -3.25279284e-01,
        -3.71359512e-01, -4.71265647e-01, -3.98457841e-01,
        -6.18020838e-01, -1.47797409e-01, -5.64329533e-01,
         6.91288566e-01,  9.25511021e-01,  6.99112364e-01,
        -2.49128229e-01, -3.90983946e-01, -2.91213427e-02,
        -1.23602751e-01,  1.15017006e-01, -3.43353461e-01,
         3.01568876e-02, -5.71264681e-01,  1.16832437e+00,
        -1.52287439e-01, -5.61048581e-01, -4.80145744e-01,
         2.49862197e-01,  6.89454240e-01,  7.58611967e-01,
         7.19526115e-01,  2.04154692e-01, -1.06929660e-01,
        -4.39690422e-02, -1.12783291e-02,  5.52197472e-01,
         1.30518936e-01, -8.88479849e-02, -4.43622483e-01,
        -1.04825141e+00, -5.95393443e-01,  1.91938942e-01,
         1.10488829e+00, -4.71001204e-01, -5.12447433e-01,
         6.63320318e-01, -5.04613681e-01,  1.00198952e+00,
        -9.43083103e-02,  4.94716286e-01,  7.54437410e-01,
        -9.64151600e-02, -6.82325745e-01, -5.94185347e-01,
         2.45722963e-01, -9.09393146e-01, -1.01393210e-01,
         1.73005250e-01, -5.06936674e-02, -4.49378234e-01,
        -1.86476740e-01,  1.71900343e-01,  6.06925364e-01,
         2.86472365e-01, -1.99174070e-01, -3.62916074e-02,
         6.29023764e-01, -5.82448068e-01,  2.52705127e-01,
        -1.99046569e-01,  7.83733156e-01, -1.65817633e-01,
         1.12197336e-02, -3.19875487e-01,  2.26507500e-01,
         3.52379176e-01,  3.35868411e-01, -7.73696672e-01,
         6.88901318e-01,  4.00213907e-01,  3.47667911e-01,
         1.34262929e-01,  5.38501839e-01,  2.68913388e-01,
        -1.93760942e-02,  1.73380635e-01, -6.63154946e-01,
         1.00290172e-01, -8.69539051e-01,  8.14909039e-01,
         1.51486658e-01,  5.34396553e-01,  2.48565308e-01,
        -2.33118576e-01,  4.14477180e-01, -3.25900809e-01,
         6.18608220e-02,  3.22675042e-01, -2.61450815e-01,
         5.02124426e-01,  1.30123428e+00,  1.23286309e-01,
         1.79179343e-01, -1.36958336e+00,  2.80910063e-01,
        -3.52841936e-02, -4.35143375e-01,  4.51586247e-01,
        -1.96588261e-01, -5.25249810e-01,  9.08694760e-01,
        -3.83513231e-01,  4.43919496e-01,  5.54583333e-01,
        -6.92309914e-02, -3.01448769e-02, -3.60625023e-01,
        -2.03582179e-01,  4.71459001e-02,  1.63980394e-01,
         1.83551695e-01, -2.81001667e-01, -1.84475487e-02],
       [-1.07779826e+00, -8.77945861e-01,  4.44723208e+00,
        -2.77391332e+00, -1.16545482e+00, -6.32137404e+00,
        -1.71336991e-01,  6.82931785e+00,  5.32877776e-01,
         3.55879634e+00,  6.46301177e-01,  6.41851358e-01,
        -8.06364739e-02, -1.11864848e+00,  2.41509750e+00,
        -2.72656500e+00,  2.42922580e+00, -2.92563961e+00,
         1.84466398e+00,  1.66633750e+00,  1.57389052e-01,
         1.49147942e+00,  9.76808817e-01, -1.42422748e+00,
        -1.00422598e+00, -1.21344648e-01,  1.77558131e+00,
        -1.42563110e+00, -2.66096376e-02, -2.46661446e+00,
        -1.00086878e+00, -2.04432037e-02, -7.85984742e-01,
        -1.44137453e+00, -4.08563038e-01,  2.53610523e+00,
         3.72478329e-02,  3.16006372e-01,  6.01300006e-01,
        -1.49875136e+00, -3.47558937e-02,  3.16231288e-01,
         6.53996650e-01,  1.02985467e+00,  1.77956641e+00,
        -1.01527374e+00,  6.81344094e-01, -2.90201486e-01,
         6.90438861e-01, -1.78693921e+00,  1.39116202e+00,
        -4.93238188e-01,  8.88235645e-01, -9.18497048e-01,
         6.97789307e-02, -2.56416665e+00, -1.34936484e+00,
         8.06544278e-02, -7.47587716e-01, -6.38759124e-01,
         9.27740978e-01, -4.79495626e-01, -2.40834883e+00,
        -1.53336973e+00,  3.47415800e-01, -1.32958814e+00,
         1.70656224e+00,  7.57479476e-01, -1.45234542e-02,
        -2.22994173e-01,  1.25185073e+00,  5.64711644e-02,
         1.42704950e+00, -3.36147530e-01,  2.87309281e-01,
         1.21252918e+00, -7.53265785e-01, -2.87774681e-01,
         1.46600249e+00,  6.02616026e-01, -7.43238937e-01,
        -5.26073379e-02,  5.84795695e-01, -4.49227160e-01,
        -4.74868132e-02,  4.07550825e-01, -3.51069620e-01,
         1.36148292e-01,  3.02680567e-01,  2.46694493e-01,
        -6.12938344e-01,  4.67901331e-01,  2.28483727e-01,
        -6.79825494e-01, -3.56131066e-01,  1.23329448e-01,
        -2.60719086e-01,  9.10504563e-02, -5.67256620e-01,
        -7.45005540e-02, -7.53397668e-02,  3.06134448e-01,
         9.89114007e-01, -1.27886211e+00, -1.09616897e+00,
        -1.04454073e+00, -8.58536051e-01, -3.39410036e-01,
        -2.16525047e-01, -9.47975688e-01, -6.19065955e-02,
         7.13206979e-02,  3.60131312e-01, -5.94026566e-02,
         9.95787014e-01, -1.94097484e-02, -5.23709708e-01,
        -3.29177280e-01,  9.48508579e-02, -9.64972775e-01,
        -1.11370625e-01,  2.08575948e-01,  8.96528568e-01,
         5.69140416e-01,  5.63926737e-01, -4.10251047e-02,
        -9.40250902e-02,  1.26149991e+00,  2.60175762e-01,
        -4.00966461e-01, -9.25326763e-01,  1.62721154e-01,
        -8.43869406e-02, -5.03711478e-01, -1.72489867e-01,
        -2.81322831e-01,  7.01745570e-02,  1.14223970e+00,
        -6.20924255e-01, -3.39714598e-01,  2.71843592e-01,
        -1.10850926e+00, -7.84634021e-02,  4.48970201e-01,
        -1.19851621e-01, -9.69932235e-02, -7.69166106e-02,
        -1.49409132e-01,  1.45702937e-01,  5.40168256e-02,
         5.99550060e-01, -3.05152002e-01, -1.77562768e-01,
         7.81935007e-01, -3.42807093e-01, -3.50230181e-01,
        -5.66661461e-01,  2.45903422e-01, -6.94091296e-01,
         9.78381861e-02,  3.84246653e-02,  6.06494638e-02,
        -2.66094512e-02,  3.11253556e-02, -3.44250149e-01,
         4.00196416e-01, -8.22437165e-02,  4.90936149e-01,
        -6.50422936e-01, -3.41364250e-02, -1.03807163e+00,
         1.74804783e-01,  3.14520770e-01, -1.53543332e+00,
         6.19277211e-01, -1.92238051e-01,  4.27282910e-02,
        -1.48656662e-01,  5.29390960e-01,  1.16165712e+00,
         9.73005571e-01, -2.68404237e-01, -2.68100634e-01,
         1.29374143e-01, -4.38546256e-02, -3.05756349e-01,
         3.27097765e-01, -4.31104185e-02, -5.50280889e-01,
         4.65746139e-01, -4.80938304e-01, -1.99663688e-01,
         2.38407837e-01,  4.32138006e-01, -5.12684528e-01,
         3.27793632e-01, -5.20959386e-01,  2.27099932e-02,
        -3.97447903e-01, -4.24002423e-01, -1.55295091e-02,
         4.07072022e-01,  6.95452549e-01, -1.85405259e-01,
        -4.55870419e-01, -5.85169871e-02, -2.47808983e-01,
         6.40285313e-01,  6.98139373e-01, -1.91535783e-01,
         5.02821115e-01,  3.74391479e-01,  5.01994602e-01,
        -1.22214115e-01,  7.06290586e-02, -3.89064327e-01,
         1.64498892e-01, -2.30673979e-01,  1.75035594e-01,
        -1.98929197e-01,  1.69619116e-01, -7.19352873e-02,
        -1.77569204e-01, -2.56335139e-01,  1.86818265e-01,
         2.81601585e-03,  1.26098805e-01, -4.64418935e-01,
         2.43027145e-01, -4.36579803e-01,  3.35214372e-01,
        -4.95290205e-01, -5.44834314e-02, -6.19935493e-01,
        -1.31821214e-01,  2.39386316e-01, -7.66901494e-01,
         3.29733761e-01,  8.80970106e-01,  1.76575241e-01,
        -2.27257828e-01, -9.24852122e-03, -8.68463327e-01,
         1.68204999e-01, -2.16721607e-01,  2.38807154e-01,
         4.15110412e-01,  1.83188157e-01,  5.88974588e-01,
        -2.72385556e-01, -7.47886258e-01, -3.49193394e-01,
         5.77971878e-01, -2.33229339e-01, -2.86429735e-01,
         4.21070813e-01, -2.38266464e-01,  1.71355348e-01,
         4.42613313e-01, -5.28301742e-01,  2.04770549e-01,
        -1.71280596e-01,  2.28879024e-01, -5.53069879e-02,
        -7.11703493e-01, -4.17441724e-01,  2.66669184e-01,
         5.31896736e-01, -4.69062414e-01, -1.84294909e-01,
        -3.89217139e-01, -1.67179282e-02, -2.87548851e-01,
         5.62572893e-01,  2.37056355e-01,  4.95018721e-01,
         1.83273571e-01,  1.89473658e-01, -2.68462667e-01,
        -2.60569068e-01,  4.94808808e-01,  4.17521758e-01,
        -3.28425151e-02, -2.91730444e-02,  3.39877980e-02,
         1.22041825e-01, -3.35084344e-02,  5.11646093e-01,
        -1.61283430e-01,  3.16397936e-01, -2.14923027e-01,
         4.27158672e-02, -5.57453393e-01,  4.25412544e-01,
        -1.20992503e-01, -4.01290217e-01, -3.30216390e-01,
         1.44209284e-01,  6.54878719e-02,  5.73406279e-01,
         8.81378258e-02,  1.72962775e-01, -4.25920768e-01,
         7.10080989e-02, -3.38467222e-02,  1.12092468e-02,
        -3.05667690e-01, -6.98433523e-01, -2.38389939e-01,
        -6.24484984e-01, -6.35919541e-01, -3.88307573e-01,
         4.48825666e-01,  7.61666914e-01, -9.70046057e-02,
         2.28635261e-01, -2.78969237e-02,  1.50824218e-01,
        -8.38957718e-02, -4.62220199e-01,  9.58914430e-03,
         3.06971136e-01,  2.77155055e-01,  1.42789759e-01,
         3.27869318e-01, -1.61930473e-01, -1.44981460e-01]])
In [12]:
# Check the number of components that have been created
pca.n_components_
Out[12]:
327
In [13]:
# explained variance of each component
print(pca.explained_variance_ratio_) 
[0.05685361 0.04063911 0.03763558 0.02922128 0.02528914 0.02204721
 0.01928718 0.01757131 0.01540722 0.01405428 0.01350938 0.01211403
 0.01119986 0.01098358 0.01033802 0.01003988 0.00936712 0.00925502
 0.00896537 0.00872971 0.00829055 0.00803682 0.00768332 0.00745709
 0.00721007 0.00696055 0.0068921  0.00665777 0.00632132 0.00617711
 0.0060383  0.00592983 0.00572632 0.00570209 0.00566881 0.00558128
 0.00537305 0.00532854 0.00518324 0.00511526 0.00486119 0.00478035
 0.00475297 0.0045825  0.00449905 0.004486   0.00442056 0.0044048
 0.00432475 0.00430896 0.00417871 0.00403848 0.00401904 0.00392742
 0.00390669 0.00387445 0.00375632 0.00371195 0.00368406 0.00363915
 0.00354918 0.00348922 0.0034752  0.00344278 0.00340872 0.0033679
 0.00331969 0.0032352  0.00320347 0.0031613  0.00312554 0.00309227
 0.00307597 0.00304839 0.00302824 0.00298684 0.00296713 0.00293995
 0.00291154 0.00289588 0.00285848 0.0028449  0.00281628 0.00280938
 0.00280373 0.00279133 0.00276583 0.00274974 0.00270938 0.0026828
 0.00266616 0.00263939 0.00262498 0.00261941 0.00259811 0.00253579
 0.00252765 0.00251258 0.00247512 0.00245186 0.00241783 0.00239157
 0.0023698  0.0023236  0.0023037  0.00226846 0.00225265 0.00223827
 0.0022208  0.0021785  0.0021765  0.0021626  0.00214873 0.00212475
 0.00207814 0.00206356 0.00205832 0.00203387 0.00202009 0.0019781
 0.00196711 0.00195063 0.00192362 0.00192114 0.0019056  0.0018947
 0.00187839 0.00186145 0.00184124 0.00181677 0.00179847 0.00178707
 0.00176135 0.00175645 0.00171728 0.00168586 0.00168432 0.00167564
 0.0016547  0.00165256 0.0016285  0.00161405 0.00159241 0.00157637
 0.00157156 0.00155893 0.00155016 0.00154756 0.00152067 0.00150921
 0.00149016 0.00148562 0.00147618 0.00146899 0.00144641 0.00144071
 0.00143433 0.00142855 0.00141384 0.00141068 0.00140292 0.00140018
 0.00139898 0.00139464 0.00137892 0.00137346 0.0013652  0.00135667
 0.00135119 0.00134854 0.00133391 0.00131378 0.00130917 0.00130138
 0.0012963  0.00127563 0.00127129 0.0012565  0.00125098 0.00122937
 0.00121937 0.00121567 0.00120144 0.00119766 0.00118574 0.00117728
 0.00116681 0.00115998 0.00115004 0.00113061 0.0011206  0.00111867
 0.00111006 0.00108476 0.00107266 0.00106136 0.00105619 0.00104969
 0.00104495 0.00103752 0.00102513 0.00100979 0.0010048  0.00098798
 0.00098294 0.00097809 0.00097405 0.00097018 0.00095418 0.00094977
 0.00094014 0.00093501 0.00091285 0.00090861 0.00090083 0.00089101
 0.00088459 0.00087588 0.00087477 0.00086066 0.00085789 0.00084811
 0.00084077 0.00083403 0.00082954 0.00082379 0.00081009 0.00080765
 0.00079963 0.00079619 0.00078925 0.00078487 0.00078071 0.00077177
 0.00076283 0.00075984 0.0007443  0.00074245 0.00073157 0.0007292
 0.00072189 0.00070632 0.00070113 0.00070024 0.00069278 0.00068906
 0.00068798 0.00068534 0.00067886 0.00066675 0.0006589  0.00065724
 0.00065009 0.00064184 0.00063892 0.00063609 0.00062879 0.00062614
 0.00062108 0.00061539 0.00061137 0.00060608 0.00060376 0.00059921
 0.00059009 0.00058542 0.00058427 0.00057712 0.00057365 0.00056997
 0.00056079 0.00055679 0.00055437 0.00055061 0.00054566 0.00054137
 0.00054078 0.00053512 0.0005325  0.00052606 0.00051768 0.00051541
 0.00051239 0.00050887 0.00050629 0.0005035  0.000501   0.00049528
 0.00049104 0.00048658 0.00048374 0.00048201 0.00047849 0.0004676
 0.00046477 0.00046048 0.00045276 0.00044635 0.00044339 0.00044194
 0.00043658 0.00043598 0.00043443 0.00043282 0.0004252  0.00042188
 0.00041866 0.00041698 0.00041243 0.00041144 0.00040687 0.0004032
 0.00039699 0.00039519 0.00039201 0.0003893  0.00038551 0.00038304
 0.00038233 0.00037827 0.00037664 0.0003738  0.00037281 0.00036816
 0.00036678 0.00036217 0.00036144]
In [14]:
# cumulative sum
print(pca.explained_variance_ratio_.cumsum())   # cumulative sum
[0.05685361 0.09749272 0.1351283  0.16434958 0.18963872 0.21168593
 0.2309731  0.24854441 0.26395164 0.27800592 0.2915153  0.30362934
 0.3148292  0.32581277 0.33615079 0.34619067 0.35555779 0.36481281
 0.37377818 0.38250789 0.39079844 0.39883526 0.40651858 0.41397567
 0.42118575 0.4281463  0.4350384  0.44169618 0.4480175  0.45419461
 0.46023291 0.46616274 0.47188906 0.47759115 0.48325996 0.48884124
 0.49421429 0.49954283 0.50472607 0.50984133 0.51470252 0.51948287
 0.52423584 0.52881834 0.53331739 0.53780339 0.54222395 0.54662875
 0.5509535  0.55526246 0.55944116 0.56347964 0.56749868 0.5714261
 0.5753328  0.57920725 0.58296357 0.58667551 0.59035958 0.59399873
 0.59754791 0.60103713 0.60451233 0.60795511 0.61136383 0.61473173
 0.61805142 0.62128662 0.62449009 0.62765139 0.63077693 0.63386919
 0.63694516 0.63999355 0.64302179 0.64600863 0.64897576 0.65191571
 0.65482725 0.65772313 0.66058161 0.66342651 0.6662428  0.66905217
 0.67185591 0.67464724 0.67741307 0.68016281 0.68287218 0.68555498
 0.68822115 0.69086054 0.69348552 0.69610492 0.69870303 0.70123882
 0.70376647 0.70627905 0.70875416 0.71120603 0.71362385 0.71601542
 0.71838522 0.72070882 0.72301252 0.72528098 0.72753363 0.7297719
 0.7319927  0.7341712  0.7363477  0.73851029 0.74065902 0.74278378
 0.74486192 0.74692548 0.74898381 0.75101768 0.75303776 0.75501586
 0.75698297 0.7589336  0.76085722 0.76277836 0.76468396 0.76657867
 0.76845706 0.77031851 0.77215976 0.77397652 0.77577499 0.77756206
 0.77932341 0.78107986 0.78279714 0.784483   0.78616731 0.78784295
 0.78949765 0.79115021 0.79277871 0.79439276 0.79598518 0.79756154
 0.7991331  0.80069203 0.80224219 0.80378975 0.80531042 0.80681963
 0.80830979 0.80979542 0.8112716  0.81274059 0.814187   0.81562771
 0.81706204 0.8184906  0.81990444 0.82131512 0.82271804 0.82411822
 0.8255172  0.82691183 0.82829075 0.82966421 0.8310294  0.83238608
 0.83373727 0.83508581 0.83641972 0.8377335  0.83904267 0.84034405
 0.84164035 0.84291598 0.84418727 0.84544377 0.84669475 0.84792412
 0.84914349 0.85035916 0.85156061 0.85275827 0.85394401 0.85512129
 0.8562881  0.85744809 0.85859812 0.85972873 0.86084933 0.861968
 0.86307806 0.86416282 0.86523548 0.86629684 0.86735303 0.86840273
 0.86944767 0.87048519 0.87151032 0.87252011 0.87352492 0.8745129
 0.87549584 0.87647393 0.87744798 0.87841816 0.87937234 0.88032211
 0.88126225 0.88219726 0.88311011 0.88401872 0.88491955 0.88581056
 0.88669515 0.88757103 0.8884458  0.88930647 0.89016436 0.89101247
 0.89185324 0.89268727 0.89351681 0.8943406  0.89515069 0.89595835
 0.89675797 0.89755417 0.89834341 0.89912829 0.899909   0.90068077
 0.90144361 0.90220344 0.90294775 0.9036902  0.90442177 0.90515097
 0.90587285 0.90657917 0.9072803  0.90798055 0.90867332 0.90936239
 0.91005036 0.91073571 0.91141456 0.91208132 0.91274022 0.91339746
 0.91404755 0.91468938 0.9153283  0.9159644  0.91659319 0.91721933
 0.9178404  0.91845579 0.91906716 0.91967323 0.920277   0.92087621
 0.9214663  0.92205172 0.92263599 0.92321311 0.92378677 0.92435674
 0.92491752 0.92547431 0.92602868 0.92657929 0.92712496 0.92766633
 0.9282071  0.92874222 0.92927472 0.92980078 0.93031846 0.93083387
 0.93134626 0.93185513 0.93236142 0.93286493 0.93336592 0.93386121
 0.93435225 0.93483883 0.93532257 0.93580459 0.93628308 0.93675067
 0.93721545 0.93767593 0.93812869 0.93857504 0.93901843 0.93946037
 0.93989695 0.94033294 0.94076737 0.94120019 0.94162539 0.94204727
 0.94246593 0.94288291 0.94329533 0.94370677 0.94411364 0.94451684
 0.94491383 0.94530903 0.94570104 0.94609034 0.94647585 0.94685889
 0.94724122 0.94761949 0.94799613 0.94836993 0.94874274 0.9491109
 0.94947768 0.94983985 0.95020129]
In [15]:
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)

total_var = pca.explained_variance_ratio_.sum() * 100

px.area(
    x=range(1, exp_var_cumul.shape[0] + 1),
    y=exp_var_cumul,
    title=f'Total Explained Variance: {total_var:.2f}%',
    labels={"x": "# Components", "y": "Explained Variance"}
)

Apply Logistic Regression to the Transformed Data¶

In [16]:
# initialize LogisticRegression algorithm
logisticRegr = LogisticRegression(solver = 'lbfgs', max_iter=2000, random_state=0)

# fit
logisticRegr.fit(X_train_pca, y_train)
Out[16]:
LogisticRegression(max_iter=2000, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=2000, random_state=0)
In [17]:
# model's overall accuray

print(metrics.accuracy_score(y_test, logisticRegr.predict(X_test_pca)))
0.9185
In [18]:
# model's confusion matrix
print(metrics.confusion_matrix(y_test, logisticRegr.predict(X_test_pca)))
[[ 965    0    2    3    1   10   10    0    4    1]
 [   0 1106   11    1    1    7    0    4    9    2]
 [   3   14  932   19   13    4   13   12   27    3]
 [   2    7   39  889    1   29    1   12   20   13]
 [   1    3    8    0  900    0   11    8    4   27]
 [   7    2   10   29    7  759   14    3   27    5]
 [   8    3    9    0   14   14  935    1    5    0]
 [   4    5   16    2   12    5    0  976    5   39]
 [   3   19    9   19    7   26    6    2  858   14]
 [   4    5    4   11   30    9    1   33    7  865]]
In [19]:
import scikitplot as skplt

skplt.metrics.plot_confusion_matrix(y_true=y_test, 
                                    y_pred=logisticRegr.predict(X_test_pca))
plt.show()

Timing of Fitting Logistic Regression after PCA¶

The whole point of this section of the tutorial was to show that you can use PCA to speed up the fitting of machine learning algorithms. The table below shows how long it took to fit logistic regression on the author's MacBook after using PCA (retaining different amounts of variance each time).